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An apparatus and method for performing a MOVHPS-MOVLPS operation on packed data using 
computer- implemented steps is described. In one embodiment, a first packed data 
operand having a pair of data elements is accessed. A second packed data operand 
having two pairs of data elements is then accessed. One of the two pairs of data 
elements in the second packed data operand is replaced with the pair of data elements 
in the first packed data operand. 
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TITLE: System and method for performing a MOVHPS-MOVLPS instruction 
Detailed Description Text (3) : 

According to one aspect of the invention, a method and apparatus are described for 
moving data elements in a packed data operand (a MOVHPS-MOVLPS operation) . The 
MOVHPS-MOVLPS operations allow, for example, the partial update of a 128 -bit packed 
register from memory, and the partial store of the 128 -bit register into 64 -bit 
memory. This allows the upper half or lower half of the register/memory to be bypassed 
to destination without modification. This has the benefit of 1) potentially achieving 
the same performance when updating a 12 8 -bit register or memory as a packed 
instruction implementation, and 2) providing the flexibility of loading into the 
packed register from different 64 -bit memory locations all storing from two different 
packed memory locations. The two halves of the 128-bit register may be assembled with 
the same performance as the packed instruction which loads or stores an entire 12 8 -bit 
register to/from a unified 128-bit memory location. Being able to access a 64-bit 
quantity, rather than a full 12 8 -bit quantity, is also useful when reorganizing data 
formats . 

Detailed Description Text (10) : 

The decode unit 140 is shown including packed data instruction set 14 5 for performing 
operations on packed data. In one embodiment, the packed data instruction set 145 
includes the following instructions: a move instruction (s) 150, a shuffle 
instruction (s) 155, an add instruction (s) (such as ADDPS) 160, and a multiply 
instruction (s) 165. The MOVAPS, SHUFPS and ADDPS instructions are applicable to packed 
floating point data, in which the results of an operation between two sets of numbers 
having a predetermined number of bits, are stored in a register having the same 
predetermined number of bits, i.e., the size or configuration of the operand is the 
same as that of the result register. The operation of each of these instructions is 
further described herein. While one embodiment is described in which the packed data 
instructions operate on floating point data, alternative embodiments could 
alternatively or additionally have similar instructions that operate on integer data. 
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ABSTRACT: 



The present invention relates to a data processing unit comprising a register file, a 
register load and store buffer connected to the register file, a single memory, and a 
bus having at least first and second word lines to form a double word wide bus 
coupling the register load and store buffer with said single memory. The register file 
at least two sets of registers whereby the first set of registers can be coupled with 
one of the word lines and the second set of registers can be coupled with the 
respective other word lines, a load and store control unit for transferring data from 
or to the memory. 

24 Claims, 9 Drawing figures 
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TITLE: Data processing unit with digital signal processing capabilities 



Detailed Description Text (8) : 

A second type of instruction which can be executed according to the present invention 
is a so called " load two half-words ( packed) "-instruction . With this instruction one 
word from either data lines la or Id is loaded and split into half-words by units 8 or 
9 placed in the respective lower halves of a word. Optionally units 12 and 13 can 
either sign-extend or zero-extend the respective half-words to words. In other words, 
in this embodiment, the 16 bit half-words are extended to 32 bits. Unit 8 or unit 9 
splits the word received from lines la or Id into two half-words and distributes them 
through units 12 and 13 to the lower halves of the respective even and odd registers. 
In units 12 and 13 these half-words can be extended to words either by filling the 
upper halves with zeros or by sign extending the upper halves . If the sign of a 
half-word is negative the upper halves of the respective register is filled up with 
"1" otherwise with "0". If units 12 and 13 are deactivated the half-words are stored 
into the lower halves of the respective even and odd registers without changing their 
upper halves. In a simplified version the least significant memory half-word is always 
stored into an even register and the most significant half-word is stored into an odd 
register adjacent to the even register. 



Detailed Description Text (10) : 

A fourth type of instruction which can be executed according to the present invention 
is a so called " store two half-words ( packed) "-instruction . With this instruction the 
lower half-words of an even and an odd register are fed to either concatenating unit 
11 or 14. The two half-words are combined to one word and the stored in the memory 
unit 1 through multiplexer 7 or 10 and either data input lines lb or lc. 

Detailed Description Text (19) : 

This so called packed arithmetic or logical instructions partition, in this 
embodiment, a 32 bit word into several identical objects, which can then be fetched, 
stored, and operated on in parallel. These instructions, in particular, allow the full 
exploitation of the 32 bit word of the data processing unit according to the present 
invention in DSP applications. 

Detailed Description Text (21) : 

The loading and storing of packed values into data or address registers is supported 
by the respective load and store instructions described above. The packed objects can 
then be manipulated in parallel by a set of special packed arithmetic instructions 
that perform such arithmetic operations as addition, subtraction, multiplication, 
division, etc. For example a multiply instruction performs two, 16 bit 
multiplication's in parallel as shown in FIG. 5. 
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ART-UNIT: 272 

PR I MARY -EXAMINER: Peikari; B. James 

ATTY- AGENT -FIRM: Conley, Rose & Tayon, P.C. Kowert; Robert C. Daffer; Kevin L. 
ABSTRACT : 

A multimedia extension unit (MEU) is provided for performing various multimedia-type 
operations. The MEU can be coupled either through a coprocessor bus or a local CPU bus 
to a conventional processor. The MEU employs vector registers, a vector ALU, and an 
operand routing unit (ORU) to perform a maximum number of the multimedia operations 
within as few instruction cycles as possible. Complex algorithms are readily performed 
by arranging operands upon the vector ALU in accordance with the desired algorithm 
flowgraph. The ORU aligns the operands within partitioned slots or sub- slots of the 
vector registers using vector instructions unique to the MEU. At the output of the 
ORU, operand pairs from vector source or destination registers can be easily routed 
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and combined at the vector ALU. The vector instructions employ special load/store 
instructions in combination with numerous operational instructions to carry out 
concurrent multimedia operations on the aligned operands. 

31 Claims, 19 Drawing figures 



3 of 3 



11/17/03 12:30 AM 



Record Display Form 



http://westbrs:8002Mn/gatexxe?r^ 





□ 



Generate Collection 



Print 



L2 0: Entry 14 0 of 311 



File: USPT 



Jan 9, 2001 



DOCUMENT- IDENTIFIER: US 6173366 Bl 

** See image for Certificate of Correction ** 

TITLE: Load and store instructions which perform unpacking and packing of data bits in 
separate vector and integer cache storage 

Brief Summary Text (41) : 

Arithmetic scaling which is lacking from many conventional operations is readily 
performed as part of the present load/store instructions. For example, packing and 
unpacking instructions found in many DSP instruction sets can be avoided. Thus, 
unpacking of an 8-bit word into a 20-bit slot occurs as part of a load instruction, 
whereas packing of a 2 0 -bit operand to an 8 -bit word occurs as part of the store 
instruction . Combining packing and unpacking operations into store and load helps 
eliminate unnecessary move operations which occur as part of stand-alone conventional 
pack and unpack instructions . 

Detailed Description Text (109) : 

The interleave mapping for 10 -bit partitions is completely transparent to the 
programmer as long as only 10 -bit loads/stores and vector instructions are performed 
on a given set of data. Interleaved mapping of 20-bit partitions is also transparent 
to the programmer if only 20-bit operations are performed. However, if 10-bit and 
2 0 -bit operations are mixed, then care must be taken to understand the mapping so that 
the expected results are produced. The interleaving can be very useful, for example, 
if a 10-bit load from an octet-sized memory location automatically expands and 
interleaves the byte-wide memory data to the upper portion of 20-bit partitions. The 
20 -bit operation can be immediately performed on this data without the need for 
explicit format conversions. Subsequently, 10-bit stores to octets can automatically 
perform the inverse 2 0 -bit to 10 -bit packing function. Thus, the present store 
operation, namely vstb mem64 , vsh performs packing of n+4 bits within a slot of a 
vector register to n/2 bits within an address of the memory unit. Given n=16, 
20-bit-to-8-bit packing can occur as part of the store operation. Additional 
operations, such as move or shift operations need not occur to perform a packing 
function. Packing serves to store the most significant bits from a slot. Unpacking is 
an operation by which n/2 bits from a memory address are loaded into n+4 bit locations 
within a slot. If n=16, then a load operation such as vldb vdh, mem64 causes 8-bits 
within a memory address to be loaded into a 20-bit slot. Utilizing load and store 
functions in such a manner thereby avoids having to implement separate unpack and pack 
instructions, respectively, within the MEU instruction set. Accordingly, the same 
result can be achieved but with fewer instructions. For MPEG, 8 -bit pixels are 
unpacked to 2 0 -bit numbers for DCT or IDCT manipulations, then the results are 
repacked to 8-bit pixels. The internals of the DCT and IDCT operations require more 
than 8 bits of precision, to which packing and unpacking are particularly 
advantageous . 

CLAIMS : 

1. A computer, comprising: 

an input/output device operably coupled to a microprocessor, wherein the 
microprocessor includes : 

an instruction cache configured to store coded first and second sets of instructions 
obtained from the input/output device, wherein said first set of instructions 
comprises integer instructions for operating on integer operands and said second set 
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of instructions comprises vector instructions for operating on vector data; 
a decode unit configured to decode said vector instructions; and 

wherein said microprocessor is configured to perform a load of data having a first bit 
size from a first memory location having said first bit size to a register slot having 
a second bit size, wherein said first bit size is smaller than said second bit size, 
and wherein said microprocessor is configured to perform an unpacking operation during 
the load to fill said register slot, wherein said microprocessor is configured to load 
said data and perform said unpacking operation in response to said decode unit 
decoding a single vector load instruction . 

6. The computer as recited in claim 1, wherein said unpacking operation occurs within 
the same instruction cycle as the vector load instruction . 

8. A computer, comprising: 

an input/output device operably coupled to a microprocessor, wherein the 
microprocessor comprises: 

an instruction cache configured to store coded first and second sets of instructions 
obtained from the input/output device, wherein said first set of instructions 
comprises integer instructions for operating on integer operands and said second set 
of instructions comprises vector instructions for operating on vector data; 

a decode unit configured to decode said vector instructions; and 

wherein said microprocessor is configured to perform a store of data having a second 
bit size from a register slot having said second bit size to a first memory location 
having a first bit size, wherein said first bit size is smaller than said second bit 
size, and wherein said microprocessor is configured to perform a packing operation 
during the store on said data to fit said data into said first memory location, 
wherein said microprocessor is configured to store said data and perform said packing 
operation in response to said decode unit decoding a single vector store instruction . 

12. The computer as recited in claim 8, wherein said packing operation occurs within 
the same instruction cycle as the vector store instruction . 

13. A microprocessor, comprising: 

an instruction cache configured to store coded first and second sets of instructions, 
wherein said first set of instructions comprises integer instructions for operating on 
integer operands and said second set of instructions comprises vector instructions for 
operating on vector data; 

a decode unit configured to decode said vector instructions; and 

wherein said microprocessor is configured to perform a load of data having a first bit 
size from a first memory location having said first bit size to a register slot having 
a second bit size, wherein said first bit size is smaller than said second bit size, 
and wherein said microprocessor is configured to perform an unpacking operation during 
the load to fill said register slot, wherein said microprocessor is configured to load 
said data and perform said unpacking operation in response to said decode unit 
decoding a single vector load instruction . 

18. The microprocessor as recited in claim 13, wherein said unpacking operation occurs 
within the same instruction cycle as the vector load instruction . 

24. A microprocessor, comprising: 

an instruction cache configured to store coded first and second sets of instructions, 
wherein said first set of instructions comprises integer instructions for operating on 
integer operands and said second set of instructions comprises vector instructions for 
operating on vector data; 
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a decode unit configured to decode said vector instructions; and 

wherein said microprocessor is configured to perform a store of data having a second 
bit size from a register slot having said second bit size to a first memory location 
having a first bit size, wherein said first bit size is smaller than said second bit 
size, and wherein said microprocessor is configured to perform a packing operation 
during the store on said data to fit said data into said first memory location, 
wherein said microprocessor is configured to store said data and perform said packing 
operation in response to said decode unit decoding a single vector store instruction . 

28. The microprocessor as recited in claim 24, wherein said packing operation occurs 
within the same instruction cycle as the vector store instruction . 
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ART-UNIT: 273 

PRIMARY -EXAMINER: Treat; William M. 

ATT Y- AGENT -FIRM: Law Offices of Peter H. Priest 

ABSTRACT : 

A hierarchical instruction set architecture (ISA) provides pluggable instruction set 
capability and support of array processors. The term pluggable is from the 
programmer's viewpoint and relates to groups of instructions that can easily be added 
to a processor architecture for code density and performance enhancements. One 
specific aspect addressed herein is the unique compacted instruction set which allows 
the programmer the ability to dynamically create a set of compacted instructions on a 
task by task basis for the primary purpose of improving control and parallel code 
density. These compacted instructions are parallelizable in that they are not 
specifically restricted to control code application but can be executed in the 
processing elements (PEs) in an array processor. The ManArray family of processors is 
designed for this dynamic compacted instruction set capability and also supports a 
scalable array of from one to N PEs. In addition, the ManArray ISA is defined as a 
hierarchy of ISAs which allows for future growth in instruction capability and 
supports the packing of multiple instructions within a hierarchy of instructions. 
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Detailed Description Text (38) : 

The goal of the packed Load/Store instructions 3 04 of FIG. 3C is to provide 
high-density code for moving data between SP registers and memory and PE registers and 
their local PE memories. In particular, these instructions facilitate rapid context 
switching for the kernel, and efficient data load/store operations for application 
tasks. The priorities for selecting load/store addressing modes have been established 
in the following order: 
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ATTY- AGENT -FIRM: Blakely, Sokoloff, Taylor & Zafman LLP 
ABSTRACT : 

A novel processor for manipulating packed data. The packed data includes a first data 
element Dl and a second data element D2 . Each of said data elements has a 
predetermined number of bits. The processor comprises a decoder, a register, and a 
circuit. The decoder is for decoding a control signal responsive to receiving the 
control signal. The register is coupled to the decoder. The register is for storing 
the packed data. The circuit is coupled to the decoder. The circuit is for generating 
a first result data element Rl and a second data element R2 . The circuit is further 
for generating Rl to represent a total number bits set in Dl, and the circuit is 
further for generating R2 to represent a total number bits set in D2 . 
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CLAIMS : 

1. A computer- implemented method comprising: 

a) decoding an instruction, the instruction indicating a storage location of a first 
packed data sequence having a set of packed data elements, said instruction operable 
to specify a variable quantity of said packed data elements, said instruction operable 
to specify a variable size of said packed data elements, and said instruction 
specifying an operation to be performed on said packed data elements; 

b) generating, in response to executing said instruction, a result packed data 
sequence having a set of result packed data elements corresponding to said set of 
packed data elements of said first packed data sequence, said result packed data 
elements respectively representing population counts of a number of bits set in said 
packed data elements of said first packed data sequence. 
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ATTY- AGENT -FIRM: Blakely Sokoloff Taylor & Zafman, LLP 



ABSTRACT : 

An emulating agent and method is provided that receives numbers having si, exponents 
and significands of varying lengths and possibly configured in a variety of 
incompatible formats and to reformat the numbers into a standard uniform format for 
uniform arithmetic computations in processors operating with different architectures. 
In one embodiment, the emulating agent has a three-field superset register configured 
to receive the sign of a number in a first field, the exponent of a number in a second 
field and the significand of a number in a third field, regardless of the original 
format of the number, resulting in a number represented in a standard uniform format 
for computation. The embodiment also allows high level access to the fields to allow 
users to control the size of the numbers inserted into the fields. 

14 Claims, 4 Drawing figures 
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Detailed Description Text (21) : 

Memory access instructions are required in order for proper format conversion. Table 4 
illustrates a sample of instructions for memory access in different operations 
involved in the format conversion. There are separate floating-point load and store 
instructions for the single, double and double extended floating-point real data type 
and the packed signed or unsigned integer data. In a preferred embodiment, the 
addressing modes and memory hint options for floating-point load and store 
instructions are the same with the integer load and store instructions. Table 4 
illustrates a list of sample floating-point load/store instructions. 
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ART-UNIT: 278 

PR I MARY -EXAMINER : Maung; Zarni 

ATTY- AGENT -FIRM: Conley, Rose & Tayon Kowert; Robert C. Daffer; Kevin L. 
ABSTRACT : 

A multimedia extension unit (MEU) is provided for performing various multimedia-type 
operations. The MEU can be coupled either through a coprocessor bus or a local CPU bus 
to a conventional processor. The MEU employs vector registers, a vector ALU, and an 
operand routing unit (ORU) to perform a maximum number of the multimedia operations 
within as few instruction cycles as possible. Complex algorithms are readily performed 
by arranging operands upon the vector ALU in accordance with the desired algorithm 
flowgraph. The ORU aligns the operands within partitioned slots or sub-slots of the 
vector registers using vector instructions unique to the MEU. At the output of the 
ORU, operand pairs from vector source or destination registers can be easily routed 
and combined at the vector ALU. The vector instructions employ special load/store 
instructions in combination with numerous operational instructions to carry out 
concurrent multimedia operations on the aligned operands. 
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TITLE: System and method for conditionally moving an operand from a source register to 
a destination register 

Brief Summary Text (41) : 

Arithmetic scaling which is lacking from many conventional operations is readily 
performed as part of the present load/store instructions. For example, packing and 
unpacking instructions found in many DSP instruction sets can be avoided. Thus, 
unpacking of an 8-bit word into a 20-bit slot occurs as part of a load instruction, 
whereas packing of a 20 -bit operand to an 8-bit word occurs as part of the store 
instruction . Combining packing and unpacking operations into store and load helps 
eliminate unnecessary move operations which occur as part of stand-alone conventional 
pack and unpack instructions . 

Detailed Description Text (116) : 

The interleave mapping for 10 -bit partitions is completely transparent to the 
programmer as long as only 10 -bit loads/stores and vector instructions are performed 
on a given set of data. Interleaved mapping of 2 0 -bit partitions is also transparent 
to the programmer if only 20 -bit operations are performed. However, if 10 -bit and 
20 -bit operations are mixed, then care must be taken to understand the mapping so that 
the expected results are produced. The interleaving can be very useful, for example, 
if a 10 -bit load from an octet -sized memory location automatically expands and 
interleaves the byte-wide memory data to the upper portion of 20-bit partitions. The 
20 -bit operation can be immediately performed on this data without the need for 
explicit format conversions. Subsequently, 10-bit stores to octets can automatically 
perform the inverse 20-bit to 10-bit packing function. Thus, the present store 
operation, namely vstb mem64 , vsh performs packing of n+4 bits within a slot of a 
vector register to n/2 bits within an address of the memory unit. Given n=16, 
20-bit-to-8-bit packing can occur as part of the store operation. Additional 
operations, such as move or shift operations need not occur to perform a packing 
function. Packing serves to store the most significant bits from a slot. Unpacking is 
an operation by which n/2 bits from a memory address are loaded into n+4 bit locations 
within a slot. If n=16, then a load operation such as vldb vdh, mem64 causes 8 -bits 
within a memory address to be loaded into a 20-bit slot. Utilizing load and store 
functions in such a manner thereby avoids having to implement separate unpack and pack 
instructions, respectively, within the MEU instruction set. Accordingly, the same 
result can be achieved but with fewer instructions. For MPEG, 8-bit pixels are 
unpacked to 20 -bit numbers for DCT or IDCT manipulations, then the results are 
repacked to 8-bit pixels. The internals of the DCT and IDCT operations require more 
than 8 bits of precision, to which packing and unpacking are particularly 
advantageous . 
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ABSTRACT : 

An improved data processor control subsystem in which a cycle counter having a 
plurality of cascade-connected stages also comprises one or more supplemental or dummy 
stages, which can be selectively inserted or removed from the chain of 
cascade-connected stages, to alter the number of sub-cycles in an operating cycle, 
thereby decreasing the complexity of associated decoding circuitry. 
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Detailed Description Text (20) : 

The above given example of an instruction by means of which data are to be transferred 
from main store 1 to local store 2 8 either in unpacked or in packed form shows that 
for otherwise identical processes both types of microinstructions differ only in that 
one cycle time in which the data are transformed from an unpacked into a packed form. 

Detailed Description Text (29) : 

The specific feature of instruction cycle counter 22a consists in that this counter 
which is e.g. designed for the simple control of a microinstruction which fetches data 
from the main store and transfers them to the local store, the conversion of the data 
from an unpacked into a packed form being possibly included in the control, also 
comprises an additional flipflop 54 activated upon request only, said flipflop 
generating the additional cycle time TZ . Output lines 59 of the respective stages are 
connected to the various gates of the data flow where the various cycle times, 
combined with the output signals of operation decoder 15 perform the control actions 
in the execution of the respective microinstruction. The combination of the control 
signals, i.e. of the output signals of operation decoder 15 with the respective cycle 
times is not shown in detail in FIG. 1 but can be concluded from FIG. 3. 



Detailed Description Text (32) : 

Microinstructions whose data cover a longer path from source to origin, as e.g. in an 
instruction which could be: "Fetch decimal data, convert them into the packed form and 
transfer them to local store " generate a control signal on line 57, so that now with 
an enabled gate 52 the additional flipflop 54 can be inserted via OR gate 55 into the 
flipflop chain between the flip-flop for cycle time T4 and the flipflop for cycle time 
T5 . The direct path of the activation signal is blocked via inverter 51 and the not 
enabled gate 53. In this manner, the additional cycle time required for converting the 
decimal data into a packed form is generated for the data propagation on the longer 
path. 
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Hardware facilities are described whereby the handling of data represented by variable 
length fields of bits may be made faster, use less storage and be less prone to errors 
in programming. The bit fields are handled independently of the natural storage 
addressing elements and boundaries. Data may be packed into main storage with the 
highest efficiency, and manipulated with a fast and efficient hardware instruction 
set . 
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Abstract Text (1) : 

Hardware facilities are described whereby the handling of data represented by variable 
length fields of bits may be made faster, use less storage and be less prone to errors 
in programming. The bit fields are handled independently of the natural storage 
addressing elements and boundaries. Data may be packed into main storage with the 
highest efficiency, and manipulated with a fast and efficient hardware instruction 
set . 

Drawing Description Text (17) : 

FIG. 17 shows an example of the use of the " load field and increment" instruction to 
access variable length bit fields within packed data; and 
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